32 research outputs found

    Novel feature selection methods for high dimensional data

    Get PDF
    [Resumen] La selección de características se define como el proceso de detectar las características relevantes y descartar las irrelevantes, con el objetivo de obtener un subconjunto de características más pequeño que describa adecuadamente el problema dado con una degradación mínima o incluso con una mejora del rendimiento. Con la llegada de los conjuntos de alta dimensión -tanto en muestras como en características-, se ha vuelto indispensable la identifícación adecuada de las características relevantes en escenarios del mundo real. En este contexto, los diferentes métodos disponibles se encuentran con un nuevo reto en cuanto a aplicabilidad y escalabilidad. Además, es necesario desarrollar nuevos métodos que tengan en cuenta estas particularidades de la alta dimensión. Esta tesis está dedicada a la investigación en selección de características y a su aplicación a datos reales de alta dimensión. La primera parte de este trabajo trata del análisis de los métodos de selección de características existentes, para comprobar su idoneidad frente a diferentes retos y para poder proporcionar nuevos resultados a los investigadores de selección de características. Para esto, se han aplicado las técnicas más populares a problemas reales, con el objetivo de obtener no sólo mejoras en rendimiento sino también para permitir su aplicación en tiempo real. Además de la eficiencia, la escalabilidad también es un aspecto crítico en aplicaciones de gran escala. La eficacia de los métodos de selección de características puede verse significativamente degradada, si no totalmente inaplicable, cuando el tamaño de los datos se incrementa continuamente. Por este motivo, la escalabilidad de los métodos de selección de características también debe ser analizada. Tras llevar a cabo un análisis en profundidad de los métodos de selección de características existentes, la segunda parte de esta tesis se centra en el desarrollo de nuevas técnicas. Debido a que la mayoría de métodos de selección existentes necesitan que los datos sean discretos, la primera aproximación propuesta consiste en la combinación de un discretizador, un filtro y un clasificador, obteniendo resultados prometedores en escenarios diferentes. En un intento de introducir diversidad, la segunda propuesta trata de usar un conjunto de filtros en lugar de uno sólo, con el objetivo de liberar al usuario de tener que decidir que técnica es la más adecuada para un problema dado. La tercera técnica propuesta en esta tesis no solo considera la relevancia de las características sino también su coste asociado -económico o en cuanto a tiempo de ejecución-, por lo que se presenta una metodología general para selección de características basada en coste. Por último, se proponen varias estrategias para distribuir y paralelizar la selección de características, ya que transformar un problema de gran escala en varios problemas de pequeña escala puede llevar a mejoras en el tiempo de procesado y, en algunas ocasiones, en precisión de clasificación

    Feature Selection for Big Visual Data: Overview and Challenges

    Get PDF
    International Conference Image Analysis and Recognition (ICIAR 2018, Póvoa de Varzim, Portugal

    A scalable saliency-based Feature selection method with instance level information

    Get PDF
    Classic feature selection techniques remove those features that are either irrelevant or redundant, achieving a subset of relevant features that help to provide a better knowledge extraction. This allows the creation of compact models that are easier to interpret. Most of these techniques work over the whole dataset, but they are unable to provide the user with successful information when only instance information is needed. In short, given any example, classic feature selection algorithms do not give any information about which the most relevant information is, regarding this sample. This work aims to overcome this handicap by developing a novel feature selection method, called Saliency-based Feature Selection (SFS), based in deep-learning saliency techniques. Our experimental results will prove that this algorithm can be successfully used not only in Neural Networks, but also under any given architecture trained by using Gradient Descent techniques

    Reduced precision discretization based on information theory

    Get PDF
    [Abstract]:In recent years, new technological areas have emerged and proliferated, such as the Internet of Things or embedded systems in drones, which are usually characterized by making use of devices with strict requirements of weight, size, cost and power consumption. As a consequence, there has been a growing interest in the implementation of machine learning algorithms with reduced precision that can be embedded in these constrained devices. These algorithms cover not only learning, but they can also be applied to other stages such as feature selection or data discretization. In this work we study the behavior of the Minimum Description Length Principle (MDLP) discretizer, proposed by Fayyad and Irani, when reduced precision is used, and how much it affects to a typical machine learning pipeline. Experimental results show that the use of fixed-point format is sufficient to achieve performances similar to those obtained when using double-precision format, which opens the door to the use of reduced-precision discretizers in embedded systems, minimizing energy consumption and carbon emissions.Ministerio de Ciencia e Innovación; PID2019-109238GB-C2Xunta de Galicia; ED431C 2018/34Secretaría Xeral de Universidades; ED431G 2019/01Fundación BBVA; Ayudas a Equipos de Investigación Científica 201

    On developing an automatic threshold applied to feature selection ensembles

    Get PDF
    © 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article "R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «B. Seijo-Pardo, V. Bolón-Canedo, y A. Alonso-Betanzos, «On developing an automatic threshold applied to feature selection ensembles», Information Fusion, vol. 45, pp. 227-245, ene. 2019" has been accepted for publication in Information Fusion. The Version of Record is available online at https://doi.org/10.1016/j.inffus.2018.02.007[Abstract]: Feature selection ensemble methods are a recent approach aiming at adding diversity in sets of selected features, improving performance and obtaining more robust and stable results. However, using an ensemble introduces the need for an aggregation step to combine all the output methods that confirm the ensemble. Besides, when trying to improve computational efficiency, ranking methods that order all initial features are preferred, and so an additional thresholding step is also mandatory. In this work two different ensemble designs based on ranking methods are described. The main difference between them is the order in which the combination and thresholding steps are performed. In addition, a new automatic threshold based on the combination of three data complexity measures is proposed and compared with traditional thresholding approaches based on retaining a fixed percentage of features. The behavior of these methods was tested, according to the SVM classification accuracy, with satisfactory results, for three different scenarios: synthetic datasets and two types of real datasets (where sample size is much higher than feature size, and where feature size is much higher than sample size).This research has been financially supported in part by the Spanish Ministerio de Economa y Competitividad (research project TIN 2015-65069-C2-1-R), by the Xunta de Galicia (research projects GRC2014/035 and the Centro Singular de Investigación de Galicia, accreditation 2016–2019) and by the European Union (FEDER/ERDF).Xunta de Galicia; GRC2014/03

    CUDA-JMI: Acceleration of feature selection on heterogeneous systems

    Get PDF
    ©2019 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Future Generation Computer Systems. The Version of Record is available online at https://doi.org/10.1016/j.future.2019.08.031Versión final aceptada de: J. González-Domínguez, R. R. Expósito, and V. Bolón-Canedo, "CUDA-JMI: Acceleration of feature selection on heterogeneous systemss", Future Generation Computer Systems, Vol. 102, pp. 426-436, Jan. 2020, https://doi.org/10.1016/j.future.2019.08.031[Abstract]: Feature selection is a crucial step nowadays in machine learning and data analytics to remove irrelevant and redundant characteristics and thus to provide fast and reliable analyses. Many research works have focused on developing new methods that increase the global relevance of the subset of selected features while reducing the redundancy of information. However, those methods that select features with high relevance and low redundancy are extremely time-consuming when processing large datasets. In this work we present CUDA-JMI, a tool based on Joint Mutual Information that accelerates feature selection by exploiting the computational capabilities of modern heterogeneous systems that contain several CPU cores and GPU devices. The experimental evaluation has been carried out in three systems with different type and amount of CPUs and GPUs using five publicly available datasets from different fields. These results show that CUDA-JMI is significantly faster than its original sequential counterpart for all systems and input datasets. For instance, the runtime of CUDA-JMI is up to 52 times faster than an existing sequential JMI-based implementation in a machine with 24 CPU cores and two NVIDIA M60 boards (four GPUs). CUDA-JMI is publicly available to download from https://sourceforge.net/projects/cuda-jmiThis research has been partially funded by projects TIN2016-75845-P and TIN-2015-65069-C2-1-R of the Ministry of Economy, Industry and Competitiveness of Spain, as well as by Xunta de Galicia, Spain projects ED431D R2016/045, ED431G/01 and GRC2014/035, all of them partially funded by FEDER, Spain funds of the European Union.Xunta de Galicia; ED431D R2016/045Xunta de Galicia; ED431G/01Xunta de Galicia; GRC2014/03

    How Important Is Data Quality? Best Classifiers vs Best Features

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] The task of choosing the appropriate classifier for a given scenario is not an easy-to-solve question. First, there is an increasingly high number of algorithms available belonging to different families. And also there is a lack of methodologies that can help on recommending in advance a given family of algorithms for a certain type of datasets. Besides, most of these classification algorithms exhibit a degradation in the performance when faced with datasets containing irrelevant and/or redundant features. In this work we analyze the impact of feature selection in classification over several synthetic and real datasets. The experimental results obtained show that the significance of selecting a classifier decreases after applying an appropriate preprocessing step and, not only this alleviates the choice, but it also improves the results in almost all the datasets tested.This work has been supported by the National Plan for Scientific and Technical Research and Innovation of the Spanish Government (Grant PID2019-109238 GB-C2), and by the Xunta de Galicia (Grant ED431C 2018/34) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431C 2018/34Xunta de Galicia; ED431G 2019/0

    Feature Selection in Big Image Datasets

    Get PDF
    [Abstract] In computer vision, current feature extraction techniques generate high dimensional data. Both convolutional neural networks and traditional approaches like keypoint detectors are used as extractors of high-level features. However, the resulting datasets have grown in the number of features, leading into long training times due to the curse of dimensionality. In this research, some feature selection methods were applied to these image features through big data technologies. Additionally, we analyzed how image resolutions may affect to extracted features and the impact of applying a selection of the most relevant features. Experimental results show that making an important reduction of the extracted features provides classification results similar to those obtained with the full set of features and, in some cases, outperforms the results achieved using broad feature vectors.This research has been financially supported in part by European Union FEDER funds, by the Spanish Ministerio de Economía y Competitividad (research project PID2019-109238GB), by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035), and by the Principado de Asturias Regional Government (research project IDI-2018-000176). CITIC as a Research Centre of the Galician University System is financed by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) through the ERDF (80%), Operational Programme ERDF Galicia 2014–2020, and the remaining 20% by the Secretaria Xeral de Universidades (ref. ED431G 2019/01).Xunta de Galicia; GRC2014/035Gobierno del Principado de Asturias; IDI-2018-000176Xunta de Galicia; ED431G 2019/0

    Low-Precision Feature Selection on Microarray Data: An Information Theoretic Approach

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] The number of interconnected devices, such as personal wearables, cars, and smart-homes, surrounding us every day has recently increased. The Internet of Things devices monitor many processes, and have the capacity of using machine learning models for pattern recognition, and even making decisions, with the added advantage of diminishing network congestion by allowing computations near to the data sources. The main restriction is the low computation capacity of these devices. Thus, machine learning algorithms capable of maintaining accuracy while using mechanisms that exploit certain characteristics, such as low-precision versions, are needed. In this paper, low-precision mutual information-based feature selection algorithms are employed over DNA microarray datasets, showing that 16-bit and some times even 8-bit representations of these algorithms can be used without significant variations in the final classification results achieved.This work has been supported by the grant Machine Learning on the Edge - Ayudas Fundación BBVA a Equipos de Investigación Científica 2019. It has also been possible thanks to the support received by the National Plan for Scientific and Technical Research and Innovation of the Spanish Government (Grant PID2019-109238GB-C2), and by the Xunta de Galicia (Grant ED431C 2018/34) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014-2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). Open Access funding provided thanks to the CRUE-CSIC agreement with Springer NatureXunta de Galicia; ED431C 2018/34Xunta de Galicia; ED431G 2019/0

    On the Effectiveness of Convolutional Autoencoders on Image-Based Personalized Recommender Systems

    Get PDF
    [Abstract] Over the years, the success of recommender systems has become remarkable. Due to the massive arrival of options that a consumer can have at his/her reach, a collaborative environment was generated, where users from all over the world seek and share their opinions based on all types of products. Specifically, millions of images tagged with users’ tastes are available on the web. Therefore, the application of deep learning techniques to solve these types of tasks has become a key issue, and there is a growing interest in the use of images to solve them, particularly through feature extraction. This work explores the potential of using only images as sources of information for modeling users’ tastes and proposes a method to provide gastronomic recommendations based on them. To achieve this, we focus on the pre-processing and encoding of the images, proposing the use of a pre-trained convolutional autoencoder as feature extractor. We compare our method with the standard approach of using convolutional neural networks and study the effect of applying transfer learning, reflecting how it is better to use only the specific knowledge of the target domain in this case, even if fewer examples are available.This research has been financially supported in part by European Union FEDER funds, by the Spanish Ministerio de Economía y Competitividad (research project PID2019-109238GB), by the Consellería de Industria of the Xunta de Galicia (research project GRC2014/035), and by the Principado de Asturias Regional Government (research project IDI-2018-000176). CITIC as a Research Center of the Galician University System is financed by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) through the ERDF (80%), Operational Programme ERDF Galicia 2014–2020 and the remaining 20% by the Secretaria Xeral de Universidades (ref. ED431G 2019/01).Xunta de Galicia; GRC2014/035Gobierno del Principado de Asturias; IDI-2018-000176Xunta de Galicia; ED431G 2019/0
    corecore